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Abstract 

Kernel methods are among the most popular techniques in machine learning. From a regularization perspec- 
tive they play a central role in regularization theory as they provide a natural choice for the hypotheses space and 
the regularization functional through the notion of reproducing kernel Hilbert spaces. From a probabilistic per- 
spective they are the key in the context of Gaussian processes, where the kernel function is known as the covariance 
function. Traditionally, kernel methods have been used in supervised learning problem with scalar outputs and 
indeed there has been a considerable amount of work devoted to designing and learning kernels. More recently 
there has been an increasing interest in methods that deal with multiple outputs, motivated partly by frameworks 
like multitask learning. In this paper, we review different methods to design or learn valid kernel functions for 
multiple outputs, paying particular attention to the connection between probabilistic and functional methods. 
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1 Introduction 



Many modern applications of machine learning require solving several decision making or prediction problems 
and exploiting dependencies between the problems is often the key to obtain better results and coping with a lack 
of data (to solve a problem we can borrow strength from a distinct but related problem). 

In sensor networks, for example, missing signals from certain sensors may be predicted by exploiting their cor- 
relation with observed signals acquired from other sensors |72j. In geostatistics, predicting the concentration of 
heavy pollutant metals, which are expensive to measure, can be done using inexpensive and oversampled vari- 
ables as a proxy [37|. In computer graphics, a common theme is the animation and simulation of physically plausible 
humanoid motion. Given a set of poses that delineate a particular movement (for example, walking), we are faced 
with the task of completing a sequence by filling in the missing frames with natural-looking poses. Human move- 
ment exhibits a high-degree of correlation. Consider, for example, the way we walk. When moving the right leg 
forward, we unconsciously prepare the left leg, which is currently touching the ground, to start moving as soon 
as the right leg reaches the floor. At the same time, our hands move synchronously with our legs. We can exploit 
these implicit correlations for predicting new poses and for generating new natural-looking walking sequences 
11061 . In text categorization, one document can be assigned to multiple topics or have multiple labels (50). In all 
the examples above, the simplest approach ignores the potential correlation among the different output compo- 
nents of the problem and employ models that make predictions individually for each output. However, these 
examples suggest a different approach through a joint prediction exploiting the interaction between the different 
components to improve on individual predictions. Within the machine learning community this type of modeling 
is often broadly referred to to as multitask learning. Again the key idea is that information shared between different 
tasks can lead to improved performance in comparison to learning the same tasks individually. These ideas are 
related to transfer learning l97l l20l [121 [74l , a term which refers to systems that learn by transferring knowledge 
between different domains, for example: "what can we learn about running through seeing walking?" 

More formally, the classical supervised learning problem requires estimating the output for any given input 
x,; an estimator /*(x„) is built on the basis of a training set consisting of N input-output pairs S = (X, Y) = 
(xi, yi), . . . , (xjv, j/jv). The input space X is usually a space of vectors, while the output space is a space of scalars. 
In multiple output learning (MOL) the output space is a space of vectors; the estimator is now a vector valued 
function f . Indeed, this situation can also be described as the problem of solving D distinct classical supervised 
problems, where each problem is described by one of the components /i,...,/d off. As mentioned before, the 
key idea is to work under the assumption that the problems are in some way related. The idea is then to exploit 
the relation among the problems to improve upon solving each problem separately. 

The goal of this survey is twofold. First, we aim at discussing recent results in multi-output/ multi-task learning 
based on kernel methods and Gaussian processes providing an account of the state of the art in the field. Second, 
we analyze systematically the connections between Bayesian and regularization (frequentist) approaches. Indeed, 
related techniques have been proposed from different perspectives and drawing clearer connections can boost 
advances in the field, while fostering collaborations between different communities. 

The plan of the paper follows. In chapter[2]we give a brief review of the main ideas underlying kernel methods 
for scalar learning, introducing the concepts of regularization in reproducing kernel Hilbert spaces and Gaussian 
processes. In chapter [3] we describe how similar concepts extend to the context of vector valued functions and 
discuss different settings that can be considered. In chapters|4]and|5]we discuss approaches to constructing mul- 
tiple output kernels, drawing connections between the Bayesian and regularization frameworks. The parameter 
estimation problem and the computational complexity problem are both described in chapter [6] In chapter we 
discuss some potential applications that can be seen as multi-output learning. Finally we conclude in chapter [8] 
with some remarks and discussion. 

2 Learning Scalar Outputs with Kernel Methods 

To make the paper self contained, we will start our study reviewing the classical problem of learning a scalar 
valued function, see for example 1 100 40 10 82 1. This will also serve as an opportunity to review connections 
between Bayesian and regularization methods. 

As we mentioned above, in the classical setting of supervised learning, we have to build an estimator (e.g. 
a classification rule or a regression function) on the basis of a training set S = (X,Y) = (xi, yi), . . . , (xjv, j/jv). 
Given a symmetric and positive bivariate function k(-, ■), namely a kernel, one of the most popular estimators in 
machine learning is defined as 

/,(x») = k^(fc(X,X) + A7VI)- 1 Y, (1) 
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where fc(X, X) has entries fc(xj,Xj), Y = [yi, . . . , i/jv] t and k Xt = [fc(xi, x*), . . . , fc(xjv, x*)] , where x, is a new 
input point. Interestingly, such an estimator can be derived from two different, though, related perspectives. 

2.1 A Regularization Perspective 

We will first describe a regularization (frequentist) perspective (see |35 105 . 100 86 1). The key point in this setting 
is that the function of interest is assumed to belong to a reproducing kernel Hilbert space (RKHS), 

/. e H k . 

Then the estimator is derived as the minimizer of a regularized functional 

1 N 

±J2(f(^)-y^ 2 + M\f\\l (2) 

i=l 

The first term in the functional is the so called empirical risk and it is the sum of the squared errors. It is a measure 
of the price we pay when predicting /(x) in place of y. The second term in the functional is the (squared) norm 
in a RKHS. This latter concept plays a key role, so we review a few essential concepts (see 1871I61H05 25[). A 
RKHS Hk is a Hilbert space of functions and can be defined by a reproducing kernefl k : X x X — ► R, which 
is a symmetric, positive definite function. The latter assumption amounts to requiring the matrix with entries 
fe(xj, Xj) to be positive for any (finite) sequence (xj). Given a kernel k, the RKHS Hk is the Hilbert space such that 
the function fc(x, ■) belongs to belongs to Hk for all x € X and 

/(x) = </,fc(x,.)> fc , v/ew h , 

where (■, -) k is the inner product in Hk- 

The latter property, known as the reproducing property, gives the name to the space. Two further properties 
make RKHS appealing: 

• functions in a RKHS are in the closure of the linear combinations of the kernel at given points, /(x) = 

fe(xi,x)ci. This allows us to describe, in a unified framework, linear models as well as a variety of 
generalized linear models; 

• the norm in a RKHS can be written as 5^ . fc(x; , Xj)acj and is a natural measure of how complex is a function. 
Specific examples are given by the shrinkage point of view taken in ridge regression with linear models l40l 
or the regularity expressed in terms of magnitude of derivatives, as is done in spline models 1 105 1. 

In this setting the functional (|2j can be derived either from a regularization point of view |35 105 [ or from the 
theory of empirical risk minimization (ERM) 11001 . In the former, one observes that, if the space Hk is large 
enough, the minimization of the empirical error is ill-posed, and in particular it responds in an unstable manner 
to noise, or when the number of samples is low Adding the squared norm stabilizes the problem. The latter point 
of view, starts from the analysis of ERM showing that generalization to new samples can be achieved if there is a 
tradeoff between fitting and complexity^ of the estimator. The functional (|2) can be seen as an instance of such a 
trade-off. 

The explicit form of the estimator is derived in two steps. First, one can show that the minimizer of {2) can 
always be written as a linear combination of the kernels centered at the training set points, 

N 

/,(x*) = ^fc(x*,Xi)ci = k^c, 

i=l 

see for example I65II19I . The above result is the well known representer theorem originally proved in 1511 (see also 
f88l and (26l for recent results and further references). The explicit form of the coefficients c = [c\ , . . . , cn] t can 
be then derived by substituting for /* (x*) in (|2j. 

1 In the following we will simply write kernel rather than reproducing kernel. 
2 For example, a measure of complexity is the Vapnik-Chervonenkis dimension 1861 
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2.2 A Bayesian Perspective 



A Gaussian process (GP) is a stochastic process with the important characteristic that any finite number of random 
variables, taken from a realization of the GP, follows a joint Gaussian distribution. A GP is usually used as a prior 
distribution for functions ]82[. If the function / follows a Gaussian process we write 

where m is the mean function and k the covariance or kernel function. The mean function and the covariance 
function completely specify the Gaussian process. In other words the above assumption means that for any finite 
set X = {x„}£U if we let /(X) = [/(xi), . . . , /(xjv)] t then 

/(X)~Af(m(X),fc(X,X)), 

where m(X) = [m(xi), . . . , m(xjv)] T and fc(X, X) is the kernel matrix. In the following, unless otherwise stated, 
we assume that the mean vector is zero. 

From a Bayesian point of view, the Gaussian process specifies our prior beliefs about the properties of the func- 
tion we are modeling. Our beliefs are updated in the presence of data by means of a likelihood function, that relates 
our prior assumptions to the actual observations. This leads to an updated distribution, the posterior distribution, 
that can be used, for example, for predicting test cases. 

In a regression context, the likelihood function is usually Gaussian and expresses a linear relation between the 
observations and a given model for the data that is corrupted with a zero mean Gaussian noise, 

p(y|/,x,a 2 ) =A/-(/(x),a 2 ), 

where a 2 corresponds to the variance of the noise. Noise is assumed to be independent and identically distributed. 
In this way, the likelihood function factorizes over data points, given the set of inputs X and a . The posterior 
distribution can be computed analytically. For a test input vector x„, given the training data S = {X, Y}, this 
posterior distribution is given by, 

p(/(x»)|S, x», <j>) = JV(/,(x„), fc»(x», x»)), 

where <f> denotes the set of parameters which include the variance of the noise, a 2 , and any parameters from the 
covariance function fc(x, x'). Here we have 

/»(x») = kj,, (fc(X, X) + a 2 I)~ 1 Y, 

fc„(x„x„) = fc(x»,x„) -kj,(fc(x,x) + o- 2 rr 1 k x , 

and finally we note that if we are interested into the distribution of the noisy predictions, p(y(x*)|S, x„, <f>), it is 
easy to see that we simply have to add a 2 to the expression for the predictive variance (see l82l ). 

Figured] represents a posterior predictive distribution for a data vector Y with TV = 4. Data points are repre- 
sented as dots in the figure. The solid line represents the mean function predicted, /„ (x» ), while the shaded region 
corresponds to two standard deviations away from the mean. This shaded region is specified using the predicted 
covariance function, fc*(x*,x*). Notice how the uncertainty in the prediction increases as we move away from the 
data points. 

Equations for /*(x„) and fc*(x*, x*) are obtained under the assumption of a Gaussian likelihood, common in 
regression setups. For non-Gaussian likelihoods, for example in classification problems, closed form solutions are 
not longer possible. In this case, one can resort to different approximations, including the Laplace approximation 
and variational methods l82l . 

2.3 A Connection Between Bayesian and Regularization Point of Views 

Connections between regularization theory and Gaussian process prediction or Bayesian models for prediction 
have been pointed out elsewhere |78 105 82 1. Here we just give a very brief sketch of the argument. We restrict 
ourselves to finite dimensional RKHS. Under this assumption one can show that every RKHS can be described in 
terms of a feature map 11001 , that is a map $ : X — > R p , such that 

p 

fc(x,x') = ^$ J (x)<r(x'). 
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Figure 1: Example of a predictive posterior distribution inferred with N = 4. The solid line corresponds to the 
predictive mean, the shaded region corresponds to two standard deviations of the prediction. Dots are values of 
the output function Y. We have also included some samples from the posterior distribution, shown as dashed 
lines. 

In fact in this case one can show that functions in the RKHS with kernel k can be written as 

/„(x) = f>^'(x) = (w,$(x)) , and ||/ w || fc = ||w||. 

Then we can build a Gaussian process by assuming the coefficient w = w 1 , . . . , w p to be distributed according to a 
multivariate Gaussian distribution. Roughly speaking, in this case the assumption /* ~ QV(0, k) becomes 

w ~ jV(0,Ip) oc e H|w|12 . 

As we noted before if we assume a Gaussian likelihood we have 

P(Y|X, /) = Af(f0q, a 2 l D ) « e -^H/»C x )- Y lln , 

where / W (X) = ((w, $( Xl )) , . . . , <w, $(x n ))) and \\f w (X) - Y\\ 2 n = T,^ =1 ((w, $(x,)) - y,) 2 . Then the posterior 
distribution is proportional to 

e -(^rll/w(X)-Y||2 + || w || 2 )^ 

and we see that a maximum a posteriori estimate will in turn give the minimization problem defining Tikhonov 
regularization |98|, where the regularization parameter is now related to the noise variance. 
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We note that in regularization the squared error is often replaced by a more general error term ^(/( x 0i 2/0- 

In a regularization perspective, the loss function £ : R x R — > R + measure the error we incur when predicting /(x) 
in place of y. The choice of the loss function is problem dependent. Often used examples are the square loss, the 
logistic loss or the hinge loss used in support vector machines (see (86)). 

The choice of a loss function in a regularization setting can be contrasted to the choice of the likelihood in a 
Bayesian setting. In this context, the likelihood function models how the observations deviate from the assumed 
true model in the generative process. The notion of a loss function is philosophically different. It represents the 
cost we pay for making errors. In Bayesian modeling decision making is separated from inference. In the inference 
stage the posterior distributions are computed evaluating the uncertainty in the model. The loss function appears 
only at the second stage of the analysis, known as the decision stage, and weighs how incorrect decisions are 
penalized given the current uncertainty. However, whilst the two notions are philosophically very different, we 
can see that, due to the formulation of the frameworks, the loss function and the log likelihood provide the same 
role mathematically. 

The discussion in the previous sections shows that the notion of a kernel plays a crucial role in statistical 
modeling both in the Bayesian perspective (as the covariance function of a GP) and the regularization perspective 
(as a reproducing kernel). Indeed, for scalar valued problems there is a rich literature on the design of kernels 
(see for example 1 86 . 90 82] and references therein). In the next sections we show how the concept of a kernel can 
be used in multi-output learning problems. Before doing that, we describe how the concepts of RKHSs and GPs 
translate to the setting of vector valued learning. 

3 Learning Multiple Outputs with Kernels Methods 

In this chapter we discuss the basic setting for learning vector valued functions and related problems (multiclass, 
multilabel) and then describe how the concept of kernels (reproducing kernels and covariance function for GP) 
translate to this setting. 

3.1 Multi-output Learning 

The problem we are interested in is that of learning an unknown functional relationship / between an input space 
X , for example X — R p , and an output space R D . In the following we will see that the problem can be tackled 
either assuming that f belongs to reproducing kernel Hilbert space of vector valued functions or assuming that f 
is drawn from a vector valued Gaussian process. Before doing this we describe several related settings all falling 
under the framework of multi-output learning. 

The natural extension of the traditional (scalar) supervised learning problem is the one we discussed in the 
introduction, when the data are pairs S = (X, Y) = (xi, yi), . . . , (xat, j/jv). For example this is the typical setting 
for problems such as motion/velocity fields estimation. A special case is that of multi-category classification 
problem or multi-label problems, where if we have D classes each input point can be associated to a (binary) 
coding vector where, for example 1 stands for presence (0 for absence) of a class instance.The simplest example is 
the so called one vs all approach to multiclass classification which, if we have {1, . . . , D} classes, amounts to the 
coding i — > e;, where (e 4 ) is the canonical basis of R D . 

A more general situation is that where different outputs might have different training set cardinalities, different 
input points or in the extreme case even different input spaces. More formally, in this case we have a training set 
S d = (X d , Y d ) = (x dl i, yd,i), (x-d,N d ,yd,N d ) for each component f d , with d = 1,...,D, where the number of 
data associated with each output, (Nd) might be different and the input for a component might belong to different 
input space (Xd). 

The terminology used in machine learning often does not distinguish the different settings above and the term 
multitask learning is often used. In this paper we use the term multi-output learning or vector valued learning to 
define the general class of problems and use the term multi-task for the case where each component has different 
inputs. Indeed in this very general situation each component can be thought of as a distinct task possibly related 
to other tasks (components). In the geostatistics literature, if each output has the same set of inputs the model 
is called isotopic and heterotopic if each output to be associated with a different set of inputs 11041 . Heterotopic 
data is further classified into entirely heterotopic data, where the variables have no sample locations in common, 
and partially heterotopic data, where the variables share some sample locations. In machine learning, the partially 
heterotopic case is sometimes referred to as asymmetric multitask learning U12tl2Tl . 

The notation in the multitask learning scenario (heterotopic case) is a bit more involved. To simplify the nota- 
tion we assume that the number of data for each output is the same. Moreover, for the sake of simplicity sometimes 
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we restrict the presentation to the isotopic setting, though the models can usually readily be extended to the more 
general setting. We will use the notation X to indicate the collection of all the training input points, {X.j}f =1 , and 
S to denote the collection of all the training data. Also we will use the notation f (X) to indicate a vector valued 
function evaluated at different training points. This notation has slightly different meaning depending on the way 
the input points are sampled. If the input to all the components are the same then X = xi, . . . , xjv and f (X) = 
/i(xi), . . . , /d(xjv). If the input for the different components are different then X = {Xd}j =1 = Xi, . . . , Xd, 
where X d = {x d ,„}£Li and f(X) = (/i(x M ), . . . ,/i(xi,jv)), •• ., (/d(xd,i), . . . ,/ d (x d ,jv)). 



3.2 Reproducing Kernel for Vector Valued Function 

The definition of RKHS for vector valued functions parallels the one in the scalar, with the main difference that the 
reproducing kernel is now matrix valued, see for example ]65 19 1 . A reproducing kernel is a symmetric function 
K:^x^-> R DxD , such that for any x, x' K(x, x') is a positive semi-definite matrix. A vector valued RKHS is 
a Hilbert space H of functions f : X — > R D , such that for very c 6 K D , and xeX,K( x, x')c, as a function of x' 
belongs to T-L and moreover K has the reproducing property 

(f,K(.,x)c> K = f(x) T c, 

where {■, -)k is the inner product in H. 

Again, the choice of the kernel corresponds to the choice of the representation (parameterization) for the func- 
tion of interest. In fact any function in the RKHS is in the closure of the set of linear combinations 



f ( x ) = 5ZK(xi,x)cj, cj 6 R D , 

i=i 

where we note that in the above equation each term K(x;, x) is a matrix acting on a vector c,-. The norm in the 
RKHS typically provides a measure of the complexity of a function and this will be the subject of the next sections. 

Note that the definition of vector valued RKHS can be described in a component-wise fashion in the following 
sense. The kernel K can be described by a scalar kernel 7? acting jointly on input examples and task indices, that is 

(K(x,x')) d , d ' = fl((x,d),(x',<0), (3) 

where R is a scalar reproducing kernel on the space X x {1, . . . , D}. This latter point of view is useful while 
dealing with multitask learning, see ]28| for a discussion. 

Provided with the above concepts we can follow a regularization approach to define an estimator by minimiz- 
ing the regularized empirical error (|2j, which in this case can be written as 



D N 
£T7E(/;( X 0^) 2 + A||f! 



i=l !=1 
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(4) 



where f = (/i , . . . , fo ) ■ Once again the solution is given by the representer theorem (65l 

N 

f ( x ) = 5ZK(x 1 ,x)c», 



and the coefficient satisfies the linear system 

c= (K(X, X) + AJVI) _1 y, 



(5) 



where c, y are ND vectors obtained concatenating the coefficients and the output vectors, and K(X, X) is an 
ND x ND with entries (K(xj, Xj)) djd /, for i, j — 1, . . . , N and d, d! = 1, . . . , D (see for example (65)). More 
explicitly 



K(X,X) = 



(K(Xi,Xi))i,i 
(K(X 2 ,Xi)) 2 ,i 



.(K(Xd,Xi))d,i 



(K(Xi,X D ))i, D 
(K(X 2 ,X£>))a,£> 



(K(X D) X D )) fl , D . 



(6) 
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where each block (K(Xj, X.j)) iy j is an iV by iV matrix (here we make the simplifying assumption that each output 
has same number of training data). Note that given a new point x* the corresponding prediction is given by 

f(x*)=K^c, 

where K x „ e~R DxND has entries (K(x»,Xj)) dld i for j = 1, . . . , N and d, d' = 1, . . . , D. 

3.3 Gaussian Processes for Vector Valued Functions 

Gaussian process methods for modeling vector-valued functions follow the same approach as in the single output 
case. Recall that a Gaussian process is defined as a collection of random variables, such that any finite number 
of them follows a joint Gaussian distribution. In the single output case, the random variables are associated to 
a single process / evaluated at different values of x while in the multiple output case, the random variables are 
associated to different processes {fd}d=i> evaluated at different values of x |2l 3/1 1 1 021 . 
The vector-valued function f is assumed to follow a Gaussian process 

f~07>(m,K), (7) 

where m € R D is a vector which components are the mean functions {m ( j(x)}£ =1 of each output and K is a 
positive matrix valued function as in section [3721 The entries (K(x, x')) d d / in the matrix K(x, x') correspond to 
the covariances between the outputs /d(x) and f d > (x') and express the degree of correlation or similarity between 
them. 

For a set of inputs X, the prior distribution over the vector f (X) is given by 

f(X)~JV(m(X),K(X,X)), 

where m(X) is a vector that concatenates the mean vectors associated to the outputs and the covariance matrix 
K(X, X) is the block partitioned matrix in l|6j. Without loss of generality, we assume the mean vector to be zero. 

In a regression context, the likelihood function for the outputs is often taken to be Gaussian distribution, so 
that 

p(y|f,x,E)=Af(f(x),E), 

where E 6 H DxD is a diagonal matrix with element^] {a d } d=1 . 

For a Gaussian likelihood, the predictive distribution and the marginal likelihood can be derived analytically. 
The predictive distribution for a new vector x» is 1 82 1 

p(f(x.)|S,f ) x.,^)=^(t(x.),K.(x.,x.)) ) (8) 

with 

f,(x«) = Kj,(K(X,X) + S)- 1 y, 
K.(x„,x.) = K(x.,x„) - K x „ (K(X,X) + S)- 1 ^,, 

where S = E ® I N , K x „ e rDxatd hag entries (K(x„,x/)) d)d / for j = 1, . . . , N and d,d' = 1, ...,£>, and <j> 
denotes a possible set of hyperparameters of the covariance function K(x, x') used to compute K(X, X) and the 
variances of the noise for each output {a d } d=1 . Again we note that if we are interested into the distribution of 
the noisy predictions it is easy to see that we simply have to add PS to the expression of the prediction variance. 
The above expression for the mean prediction coincides again with the prediction of the estimator derived in the 
regularization framework. 

In the following chapters we describe several possible choices of kernels (covariance function) for multi-output 
problems. We start in the next chapter with kernel functions that clearly separate the contributions of input and 
output. We will see later alternative ways to construct kernel functions that interleave both contributions in a non 
trivial way. 

3 This relation derives from j/d(x) = /d(x) + e<i(x), for each d, where {td(x)}£ =1 are independent white Gaussian noise processes with 
variance a d . 
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4 Separable Kernels and Sum of Separable Kernels 



In this chapter we review a special class of multi-output kernel functions that can be formulated as a sum of 
products between a kernel function for the input space alone, and a kernel function that encodes the interactions 
among the outputs. We refer to this type of multi-output kernel functions as separable kernels and sum of separable 
kernels (SoS kernels). 

We consider a class of kernels of the form 

(K(x,x')) djd / = fe(x,x')fe T (d,d'), 

where k, kr are scalar kernels on X x X and {1, . . . , D} x {1, . . . , D}. 
Equivalently one can consider the matrix expression 

K(x,x') = fc(x,x')B, (9) 

where B is a D x D symmetric and positive semi-definite matrix. We call this class of kernels separable since, 
comparing to l|3), we see that the contribution of input and output is decoupled. 
In the same spirit a more general class of kernels is given by 

Q 

K(x,x') =^]fc,(x,x')B,. 

9 = 1 

For this class of kernels, the kernel matrix associated to a data set X has a simpler form and can be written as 

Q 

K(X,X) = ® MX,X), (10) 

q=l 

where ® represents the Kronecker product between matrices. We call this class of kernels sum of separable kernels 
(SoS kernels). 

The simplest example of separable kernel is given by setting kr(d, d!) = 5 d , d >, where 5^' is the Kronecker 
delta. In this case B = Ijv, that is all the outputs are treated as being unrelated. In this case the kernel matrix 
K(X, X), associated to some set of data X, becomes block diagonal. Since the off diagonal terms encode output 
relatedness. We can see that the matrix B encodes dependencies among the outputs. 

The key question is how to choose the scalar kernels {k q }® =1 and especially how to design, or learn, the matri- 
ces {B 9 }^ =1 . This is the subject we discuss in the next few sections. We will see that one can approach the problem 
from a regularization point of view, where kernels will be defined by the choice of suitable regularizers, or, from a 
Bayesian point of view, constructing covariance functions from explicit generative models for the different output 
components. As it turns out these two points of view are equivalent and allow for two different interpretations of 
the same class of models. 

4.1 Kernels and Regularizers 

In this section we largely follow the results in |64ll65ll27l and 0. A possible way to design multi-output kernels 
of the form (|9) is given by the following result. If K is given by {9) then is possible to prove that the norm of a 
function in the corresponding RKHS can be written as 

D 

iK= E B L*' (11) 
d,d'=l 

where B 1 " is the pseudoinverse of B and f = (/i, ■ ■ ■ , /d). The above expression gives another way to see why the 
matrix B encodes the relation among the components. In fact, we can interpret the right hand side in the above 
expression as a regularizer inducing specific coupling among different tasks {ft,ft')k with different weights given 
by B^ . This result says that any such regularizer induces a kernel of the form {9). We illustrate the above idea 
with a few examples. 
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Mixed Effect Regularizer Consider the regularizer given by 

R(f) = A u (c u £\\f t \\l+u,Dj2\\ft-^^f<\\l) (12) 
\ 1=1 1=1 g=l / 

where A u = 2 (i-ui)(i-u+uD) an d ^ u = (2 — 2lj + ljD). The above regularizer is composed of two terms: the 
first is a standard regularization term on the norm of each component of the estimator; the second forces each f t 
to be close to the mean estimator across the components, / = zL/fLi fi- The corresponding kernel imposes a 
common similarity structure between all the output components and the strength of the similarity is controlled by 
a parameter u>, 

K w (x,x') = fe(x,x')(wl + (l-w)Ir,) (13) 

where 1 is the D x D matrix whose entries are all equal to 1, and A: is a scalar kernel on the input space X. 
Setting lj = corresponds to treating all components independently and the possible similarity among them is 
not exploited. Conversely w = 1 is equivalent to assuming that all components are identical and are explained by 
the same function. By tuning the parameter u the above kernel interpolates between this two opposites cases. We 
note that from a Bayesian perspective B is a correlation matrix with all the off-diagonals equal to uj, which means 
that the output of the Gaussian process are exchangeable. 

Cluster Based Regularizer. Another example of regularizer, proposed in 1251 , is based on the idea of grouping 
the components into r clusters and enforcing the components in each cluster to be similar. Following |47|, let us 
define the matrix E as the D x r matrix, where r is the number of clusters, such that Ef jC = 1 if the component 
I belongs to cluster c and otherwise. Then we can compute the D x D matrix M = E(E T E)~ 1 E T such that 
Mi i9 = — if components I and q belong to the same cluster c, and m c is its cardinality, M.c, q = otherwise. 
Furthermore let 1(c) be the index set of the components that belong to cluster c. Then we can consider the following 
regularizer that forces components belonging to the same cluster to be close to each other: 

r r 

i? ( f ) = ei E E Wft-fc\\l+^m a \\f c \\l, (14) 
c=ieei(c) c=i 

where f c is the mean of the components in cluster c and ei , e-2 are parameters balancing the two terms. Straight- 
forward calculations show that the previous regularizer can be rewritten as -R(f) = ~}2e q (fi > fi)h> wnere 

Ge, q =e 1 8i q + (e 2 -ei)M t<q . (15) 
Therefore the corresponding matrix valued kernel is K(x, x') = fe(x, x')G^ . 

Graph Regularizer. Following [64 [92, we can define a regularizer that, in addition to a standard regulariza- 
tion on the single components, forces stronger or weaker similarity between them through a given D x D positive 
weight matrix M, 

1 ° ° 
R(f) = g E - f«WlM lq + J2 II/<II*Mm. (16) 

The regularizer J(/) can be rewritten as: 

D D 

J2 {\\MlM t . q -(f t J q ) k M e , q )+J2\\MlM e , e = 
i, q =i i=i 

D D D 

Eii^h'E( 1 + ^) m ^- E M*, a = 

1=1 9 = 1 £,9=1 

D 

E (/*./«>* L '.« ( 17 ) 

£,9=1 

where L = D M, with D £i9 = 5i iq (j2h=i M « + M e,g)- Therefore the resulting kernel will be K(x,x') = 

fc(x, x')L' , with fc(x, x') a scalar kernel to be chosen according to the problem at hand. 

In the next section we will see how models related to those described above can be derived from suitable 
generative models. 
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4.2 Coregionalization Models 



The use of probabilistic models and Gaussian processes for multi-output learning was pioneered and largely de- 
veloped in the context of geostatistics, where prediction over vector-valued output data is known as cokriging. 
Geostatistical approaches to multivariate modelling are mostly formulated around the "linear model of coregion- 
alization" (LMC) (49ll37l , that can be considered as a generative approach for developing valid covariance func- 
tions. Covariance functions obtained under the LMC assumption follow the form of a sum of separable kernels. 
We will start considering this model and then discuss how several models recently proposed in the machine learn- 
ing literature are special cases of the LMC. 

4.2.1 The Linear Model of Coregionalization 

In the linear model of coregionalization, the outputs are expressed as linear combinations of independent random 
functions. This is done in a way that ensures that the resulting covariance function (expressed jointly over all the 
outputs and the inputs) is a valid positive semidefinite function. Consider a set of D outputs {fd(^)}d=i with 
x G R p . In the LMC, each component fd is expressed as |49| 

Q 

/d( X ) = ^2 a d,qUq(*), 
3=1 

where the latent functions u q (x), have mean zero and covariance cov[u g (x), u q >(x.')] = k q (x, x') if q = q', and a d:q 
are scalar coefficients. The processes {u q {x)~\ q —\ are independent for q ^ q'. The independence assumption can 
be relaxed and such relaxation is presented as an extension in section l43l Some of the basic processes u g (x) and 
u q i (x') can have the same covariance fc 9 (x, x'), while remaining independent. 

A similar expression for {fd(x)}d=i can be written grouping the functions u q (x) which share the same covari- 
ance mesa 

Q «« 

■W X ) = Y Y a d, q Uq(X-), (18) 
(J=l 1=1 

where the functions m^(x), with q = 1, . . . , Q and i = 1, . . . , R q , have mean equal to zero and covariance 
cov[w^(x),u*/(x')] = fcq(x, x') if i = i' and q — q . Expression fl8t means that there are Q groups of functions 
u q (x) and that the functions u\ (x) within each group share the same covariance, but are independent. The cross 
covariance between any two functions /d(x) and f d i (x) is given in terms of the covariance functions for u q (x) 

Q Q Rq Rq 

CO V [/ d (x),/ d /(x')] = Y Y Y a d,q a d>, q > COV[w l 9 (x),M*/(x')]. 

q — 1 g' = l i = l i' — l 

The covariance cov[/d(x), f d i (x')] is given by (K(x, x')) did i. Due to the independence of the functions u q (x), the 
above expression reduces to 

Q R q Q 

(K(x, x'))d,d' = ^5Z a d.9 a d',9 fc <;( x ' x ') = Y b d,d' k i^ x '^ ( 19 ' 

q—1 i=l q—i 

with b q d d , — Ylf=i a \q a \' ,q- The kernel K(x, x') can now be expressed as 

Q 

K(x,x') = ^B 9 fc 9 (x,x'), (20) 

8=1 

where each B q G R DxD is known as a coregionalization matrix. The elements of each B 9 are the coefficients b q d 
appearing in equation JT9^ . The rank for each matrix B g is determined by the number of latent functions that share 
the same covariance function fc q (x, x'), that is, by the coefficient R q . 

Equation ( fT5t can be interpreted as a nested structure [104] in which the outputs f d (x) are first expressed as a 
linear combination of spatially uncorrelated processes f d (x) = Ylq=i /d( x )' w i m E[/J( x )] = and cov [/J (x), f d , (x')] = 
b q d d , k q (x, x') if q = q , otherwise it is equal to zero. At the same time, each process f d (x) can be represented as a set 
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of uncorrelated functions weighted by the coefficients a' d q , f% (x) = J^»=i a d,q u 9 ( x ) where again, the covariance 
function for Ug(x) is fc 9 (x, x'). 

Therefore, starting from a generative model for the outputs, the linear model of coregionalization leads to a sum 
of separable kernels that represents the covariance function as the sum of the products of two covariance functions, 
one that models the dependence between the outputs, independently of the input vector x (the coregionalization 
matrix B g ), and one that models the input dependence, independently of the particular set of functions {/d(x)} 
(the covariance function fc 9 (x,x')). The covariance matrix for f (X) is given by {Tol . 

4.2.2 Intrinsic Coregionalization Model 

A simplified version of the LMC, known as the intrinsic coregionalization model (ICM) (see (37l ), assumes that the 
elements b q d of the coregionalization matrix B 9 can be written as b d d , — v dtd ib q , for some suitable coefficients 
Vd,d' ■ With this form for b q d d , , we have 

Q Q 
cov[/d(x),/ d /(x')] = y]t>d,d/6 g fcg(x,x / ),= v dA i y^b g fcq(x,x') 

9 =1 9 =1 

= Wd,d'fc(x,x'), 

where fc(x, x') = Ylq=i b q k q (x,x'). The above expression can be seen as a particular case of the kernel function ob- 
tained from the linear model of coregionalization, with Q = 1. In this case, the coefficients v d . d ' — J2f=i a d,i a d',i ~ 
b d d'i M d the kernel matrix for multiple outputs becomes K(x, x') = fc(x, x')B as in {9). 
The kernel matrix corresponding to a dataset X takes the form 

K(X,X) =B®fc(X,X). (21) 

One can see that the intrinsic coregionalization model corresponds to the special separable kernel often used in the 
context of regularization. Notice that the value of R\ for the coefficients v dtd i = $^;=i dd,i a d',i = b d d ,, determines 
the rank of the matrix B. 

As pointed out by (37), the ICM is much more restrictive than the LMC since it assumes that each basic covari- 
ance fc q (x, x') contributes equally to the construction of the autocovariances and cross covariances for the outputs. 
However, the computations required for the corresponding inference are greatly simplified, essentially because of 
the properties of the Kronecker product. This latter point is discussed in detail in Section|6] 

It can be shown that if the outputs are considered to be noise-free, prediction using the intrinsic coregional- 
ization model under an isotopic data case is equivalent to independent prediction over each output 1411 . This 
circumstance is also known as autokrigeability 11041 . 

4.2.3 Comparison Between ICM and LMC 

We have seen before that the intrinsic coregionalization model is a particular case of the linear model of core- 
gionalization for Q — 1 (with R q / 1) in equation[l9] Here we contrast these two models. Note that a different 
particular case of the linear model of coregionalization is assuming R q — 1 (with Q ^ 1). This model, known in 
the machine learning literature as the semiparametric latent factor model (SLFM) |96 1, will be introduced in the 
next subsection. 

To compare the two models we have sampled from a multi-output Gaussian process with two outputs (D — 2), 
a one-dimensional input space (x 6 R) and a LMC with different values for R q and Q. As basic kernels fc 9 (x, x') 
we have used the exponentiated quadratic (EQ) kernel given as (82), 

fc 9 (x,x')=exp(-^^), 

where || || represents the Euclidian norm and £ q is known as the characteristic length-scale. The exponentiated 
quadratic is variously referred to as the Gaussian, the radial basis function or the squared exponential kernel. 

Figure |2] shows samples from the intrinsic coregionalization model for R q = 1, meaning a coregionalization 
matrix Bi of rank one. Samples share the same length-scale and have similar form. They have different variances, 
though. Each sample may be considered as a scaled version of the latent function, as it can be seen from equation 
[I8]with Q = 1 and R q = 1, 

fi(x) = ai^u^a;), / 2 (:r) = a\^u\{x), 
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Figure 2: Two samples from the intrinsic coregionalization model with rank one, this is R q = 1. Solid lines 
represent one of the samples, and dashed lines represent the other sample. Samples are identical except for scale. 

where we have used x instead of x for the one-dimensional input space. 

Figure [3] shows samples from an ICM of rank two. From equation[18j we have for Q = 1 and R q — 2, 



where u\ (x) and u\(x) are sampled from the same Gaussian process. Outputs are weighted sums of two different 
latent functions that share the same covariance. In contrast to the ICM of rank one, we see from figure[3]that both 
outputs have different forms, although they share the same length-scale. 

Figure |4] displays outputs sampled from a LMC with R q — 1 and two latent functions (Q = 2) with different 
length-scales. Notice that both samples are combinations of two terms, a long length-scale term and a short length- 
scale term. According to equation[l8] outputs are given as 



where ui(x) and u 2 (x) are samples from two Gaussian processes with different covariance functions. In a similar 
way to the ICM of rank one (see figure|2), samples from both outputs have the same form, this is, they are aligned. 

We have the additional case for a LMC with R q = 2 and Q — 2 in figure |5] According to equation [18] the 
outputs are give as 



fi(x) = al A ul(x) + al A ul(x), f 2 (x) 



<A,iu\{x) + a\^u\{x), 



= a\ tl u\{x) + a\ a u\(x), f 2 (x) = a\^u\{x) + a^ 2 ul(x), 



fi(x) = al^u^x) + a\ tl u\(x) + a\ >2 u\(x) + a\ a u%{x), 
h{x) = a\ tl u\(x) + a\ tl u\(x) + a 2 .2U 2 {x) + a% >2 u 2 (x), 
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Figure 3: Two samples from the intrinsic coregionalization model with rank two, R q = 2. Solid lines and dashed 
lines represent different samples. Although samples from different outputs have the same length-scale, they look 
different and are not simply scaled versions of one another. 



where the pair of latent functions u\(x) and u\{x) share their covariance function and the pair of latent functions 
u\{x) and u%(x) also share their covariance function. As in the case of the LMC with R q = 1 and Q = 2 in figure 
[4] the outputs are combinations of a term with a long length-scale and a term with a short length-scale. A key 
difference however, is that, for R q = 2 and Q = 2, samples from different outputs have different shapesQ 

4.2.4 Linear Model of Coregionalization in Machine Learning and Statistics 

The linear model of coregionalization has already been used in machine learning in the context of Gaussian pro- 
cesses for multivariate regression and in statistics for computer emulation of expensive multivariate computer 
codes. 

As we have seen before, the linear model of coregionalization imposes the correlation of the outputs explicitly 
through the set of coregionalization matrices. A simple idea used in the early papers of multi-output GPs for 
machine learning was based on the intrinsic coregionalization model and assumed B = Id- In other words, the 
outputs were considered to be conditionally independent given the parameters <fi. Correlation between the outputs 
was assumed to exist implicitly by imposing the same set of hyperparameters <p for all outputs and estimating 
those parameters, or the kernel matrix fc(X, X) directly, using data from all the outputs ] 66 55. 1 13 1 . 

4 Notice that samples from each output are not synchronized, meaning that the maximums and minimus do not always occur at the same 
input points. 
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Figure 4: Two samples from a linear model of coregionalization with R q = 1 and Q = 2. The solid lines represent 
one of the samples. The dashed lines represent the other sample. Samples are the weigthed sums of latent functions 
with different length-scales. 



In this section, we review more recent approaches for multiple output modeling that are different versions of 
the linear model of coregionalization. 

Semiparametric latent factor model. The semiparametric latent factor model (SLFM) proposed by |96| turns 
out to be a simplified version of the LMC. In fact it corresponds to setting R q = 1 in dT8t so that we can rewrite 
equation fTTJt as 

Q 

K(X, X) = ^2 a 9 aj ® fc«(X, X) , 

9 = 1 

where a q 6 R Cxl with elements {ad,q}d=i an d Q fixed. With some algebraic manipulations, that exploit the 
properties of the Kronecker product, we can write 

Q 

K(X,X) = ^(a, ®Iiv)fc 9 (X,X)(aJ ®l N ) = (A ® Ijv)K(A t ® Ijv), 

9 = 1 

where A G R D x Q is a matrix with columns a 9 and K G R f,?iv xQN is a block diagonal matrix with blocks given by 

Mx,x). 
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LMC with R q = 2 and Q = 2, f x (x) 
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LMC with R q = 2 and Q = 2, f 2 (x) 

Figure 5: Two samples from a linear model of coregionalization with R q = 2 and Q = 2. The solid lines represent 
one of the samples. The dashed lines represent the other sample. Samples are the weigthed sums of four latent 
functions, two of them share a covariance with a long length-scale and the other two share a covariance with a 
shorter length-scale. 

The functions w 9 (x) are considered to be latent factors and the semiparametric name comes from the fact 
that it is combining a nonparametric model, that is a Gaussian process, with a parametric linear mixing of the 
functions u 9 (x). The kernels k q , for each basic process is assumed to be exponentiated quadratic with a different 
characteristic length-scale for each input dimension. The informative vector machine (IVM) |57| is employed to 
speed up computations. 

Gaussian processes for Multi-task, Multi-output and Multi-class The intrinsic coregionalization model 
is considered by 1121 in the context of multitask learning. The authors use a probabilistic principal component 
analysis (PPCA) model to represent the matrix B. The spectral factorization in the PPCA model is replaced by 
an incomplete Cholesky decomposition to keep numerical stability. The authors also refer to the autokrigeability 
effect as the cancellation of inter- task transfer [12], and discuss the similarities between the multi-task GP and the 
ICM, and its relationship to the SLFM and the LMC. 

The intrinsic coregionalization model has also been used by 1721 . Here the matrix B is assumed to have a 
spherical parametrization, B = diag(e)S T S diag(e), where e gives a description for the scale length of each output 
variable and S is an upper triangular matrix whose i-th column is associated with particular spherical coordinates 
of points in R l (for details see sec. 3.4 [71 1). The scalar kernel k is represented through a Matern kernel, where 
different parameterizations allow the expression of periodic and non-periodic terms. Sparsification for this model 
is obtained using an IVM style approach. 

In a classification context, Gaussian processes methodology has been mostly restricted to the case where the 
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outputs are conditionally independent given the hyperparameters <p ]66 110 55 89l lll3ll82l . Therefore, the kernel 
matrix K(X, X) takes a block-diagonal form, with blocks given by (K(Xd, X.d))d,d- Correlation between the out- 
puts is assumed to exist implicitly by imposing the same set of hyperparameters 4> for all outputs and estimating 
those parameters, or directly the kernel matrices (K(X<j, ~X.d))d,d, using data from all the outputs l66. [55l[TT3ll82l . 
Alternatively, it is also possible to have parameters <pd associated to each output I110ll89l . 

Only recently, the intrinsic coregionalization model has been used in the multiclass scenario. In 1 93 1, the authors 
use the intrinsic coregionalization model for classification, by introducing a probit noise model as the likelihood. 
Since the posterior distribution is no longer analytically tractable, the authors use Gibbs sampling, Expectation- 
Propagation (EP) and variational Bayes0 to approximate the distribution. 

Computer emulation. A computer emulator is a statistical model used as a surrogate for a computationally 
expensive deterministic model or computer code, also known as a simulator. Gaussian processes have become 
the preferred statistical model among computer emulation practitioners (for a review see J 70]). Different Gaussian 
process emulators have been recently proposed to deal with several outputs [42 23 . 83 . 63 . 9. 79l. 

In 1 42 1, the linear model of coregionalization is used to model images representing the evolution of the implo- 
sion of steel cylinders after using TNT and obtained employing the so called Neddemeyer simulation model (see 
1421 for further details). The input variable x represents parameters of the simulation model, while the output is 
an image of the radius of the inner shell of the cylinder over a fixed grid of times and angles. In the version of 
the LMC that the authors employed, R q — I and the Q vectors a. q were obtained as the eigenvectors of a PCA 
decomposition of the set of training images. 

In 1 23 1, the intrinsic coregionalization model is employed for emulating the response of a vegetation model 
called the Sheffield Dynamic Global Vegetation Model (SDGVM) fTTTl . Authors refer to the ICM as the Multiple- 
Output (MO) emulator. The inputs to the model are ten (p = 10) variables related to broad soil, vegetation and 
climate data, while the outputs are time series of the net biome productivity (NBP) index measured at a particular 
site in a forest area of Harwood, UK. The NBP index accounts for the residual amount of carbon at a vegetation 
site after some natural processes have taken place. In the paper, the authors assume that the outputs correspond 
to the different sampling time points, so that D = T, being T the number of time points, while each observation 
corresponds to specific values of the ten input variables. Values of the input variables are chosen according to a 
maxi-min Latin hypercube design. 

Rougier (83l introduces an emulator for multiple-outputs that assumes that the set of output variables can be 
seen as a single variable while augmenting the input space with an additional index over the outputs. In other 
words, it considers the output variable as an input variable. (23), refers to the model in l83l as the Time Input 
(TI) emulator and discussed how the TI model turns out to be a particular case of the MO model that assumes a 
particular exponentiated quadratic kernel (see chapter 4 1 82 [) for the entries in the coregionalization matrix B. 

McFarland et al. ]63| consider a multiple-output problem as a single output one. The setup is similar to the 
one used in (23), where the number of outputs are associated to different time points, this is, D = T. The outputs 
correspond to the time evolutions of the temperature of certain location of a container with decomposing foam, 
as function of five different calibration variables (input variables in this context, p — 5). The authors use the time 
index as an input (akin to [831) and apply a greedy-like algorithm to select the training points for the Gaussian 
process. Greedy approximations like this one have also been used in the machine learning literature (for details, 
see (82), page 174). 

Similar to 1831 and 1631 , Bayarri et al. J9l use the time index as an input for a computer emulator that evaluates 
the accuracy of CRASH, a computer model that simulates the effect of a collision of a vehicle with different types 
of barriers. 

Quian et al. (79l propose a computer emulator based on Gaussian processes that supports quantitative and 
qualitative inputs. The covariance function in this computer emulator is related to the ICM in the case of one 
qualitative factor: the qualitative factor is considered to be the index of the output, and the covariance function 
takes again the form k(x, x')kT(d, d'). In the case of more than one qualitative input, the computer emulator could 
be considered a multiple output GP in which each output index would correspond to a particular combination of 
the possible values taken by the qualitative factors. In this case, the matrix B in ICM would have a block diagonal 
form, each block determining the covariance between the values taken by a particular qualitative input. 

'Mathematical treatment for each of these inference methods can be found in |10l , chapters fO and 11. 
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4.3 Extensions 



In this section we describe further developments related to the setting of separable kernels or SoS kernels, both 
from a regularization and a Bayesian perspective. 

4.3.1 Extensions Within the Regularization Framework 

When we consider kernels of the form K(x, x') = fc(x, x')B, a natural question is whether the matrix B can be 
learned from data. In a regression setting, one idea is to estimate B in a separate inference step as the covariance 
matrix of the output vectors in the training set and this is standard in the geostatistics literature |104|. A further 
question is whether we can learn both B and an estimator within a unique inference step. This is the question 
tackled in |48|. The authors consider a variation of the regularizer in i fMt and try to learn the cluster matrix as a 
part of the optimization process. More precisely the authors considered a regularization term of the form 

r r 

R(f)^e 1 \\f\\ k + e 2 J2 m ^\\fc-7\\l+^Y,J2 Wf l -fc\\l (22) 
o=i c=i iei(c) 

where we recall that r is the number of clusters. The three terms in the functional can be seen as: a global penalty, 
a term penalizing between cluster variance and a term penalizing within cluster variance. As in the case of the 
regularizer in (14), the above regularizer is completely characterized by a cluster matrix M, i.e. i?(f) = RM.(f) 
(note that the corresponding matrix B will be slightly different from fl5t). 
The idea is then to consider a regularized functional 

D 1 N 

E7v^ (/j(Xl) ^ l)2 + A - RM(f) (23) 

i=l i=l 

to be minimized jointly over f and M (see [48 1 for details). This problem is typically non tractable from a com- 
putational point of view, so the authors in [48 1 propose a relaxation of the problem which can be shown to be 
convex. 

A different approach is taken in (4) and |5). In this case the idea is that only a a small subset of features is useful 
to learn all the components /tasks. In the simplest case the authors propose to minimize a functional of the form 

D f 1 N 2 T 1 

1 Jj zZ^dU Xi - y d>i f + Xwjw d \ . 

d=l I i=l J 

over wi, . . . ,wu £ R P ,U € R D> '' D under the constraint TxQjJUt) < 7. Note that the minimization over the 
matrix U couples the otherwise disjoint component-wise problems. The authors of (U discuss how the above 
model is equivalent to considering a kernel of the form 

K(x., x') = A:d(x, x')Id, &d(x, x') = x t Dx' 

where D is a positive definite matrix and a model which can be described components wise as 

p 

/ d (x) = ad.jXj = ajx, 

making apparent the connection with the LMC model. In fact, it is possible to show that the above minimization 
problem is equivalent to minimizing 

D j N D 

*^2i TV 5Z( a d x ' — Vd.i) 2 + A ^ ajDa d , (24) 

d=l i=l d=l 

over al, . . . , a' D 6 R p and Tr(D) < \, where the last restriction is a convex approximation of the low rank re- 
quirement. Note that from a Bayesian perspective the above scheme can be interpreted as learning a covariance 
matrix for the response variables which is optimal for all the tasks. In (U, the authors consider a more general 
setting where D is replaced by F(D) and show that if the matrix valued function F is matrix concave, then the 
induced minimization problem is jointly convex in (aj) and D. Moreover, the authors discuss how to extend the 



19 



above framework to the case of more general kernel functions. Note that an approach similar to the one we just 
described is at the basis of recent work exploiting the concept of sparsity while solving multiple tasks. These latter 
methods cannot in general be cast in the framework of kernel methods and we refer the interested reader to 1691 
and references therein. 

For the reasoning above the key assumption is that a response variable is either important for all the tasks or 
not. In practice it is probably often the case that only certain subgroups of tasks share the same variables. This idea 
is at the basis of the study in |5|, where the authors design an algorithm to learn at once the group structure and 
the best set of variables for each groups of tasks. Let Q = (Gt)J = i be a partition of the set of components /tasks, 
where G t denotes a group of tasks and \G t \ < D. Then the author propose to consider a functional of the form 

f 1 N 2 T T 

m j n E arf m u t ^)N S (ad Ut Xi " Vd ' l)2 + XWd Wd 
G t es dea t I »=i 

+ 7 Tr(t/ f T t/t)j, 

where Ui , . . . Ut is a sequence of p by p matrices. The authors show that while the above minimization problem 
is not convex, stochastic gradient descent can be used to find local minimizers which seems to perform well in 
practice. 

4.3.2 Extensions from the Gaussian Processes Perspective 

A recent extension of the linear model of coregionalization expresses the output covariance function through a 
linear combination of nonorthogonal latent functions 1391 . In particular, the basic processes u^(x) are assumed to 
be nonorthogonal, leading to the following covariance function 

Q Q 

COv[f(x),f(x')] = ^ B 9,9 ,fc 9,9'( X ' X '), 
q=l q' = l 

where B g?? / are cross-coregionalization matrices. Cross-covariances k Qiq i (x, x') can be negative (while keeping pos- 
itive semidefiniteness for cov[f (x), f (x')]), allowing negative cross-covariances in the linear model of coregional- 
ization. The authors argue that, in some real scenarios, a linear combination of several correlated attributes are 
combined to represent a single model and give examples in mining, hydrology and oil industry 1391 . 



5 Beyond Separable Kernels 

Working with separable kernels or SoS kernels is appealing for their simplicity, but can be limiting in several 
applications. Next we review different types of kernels that go beyond the separable case or SoS case. 

5.1 Invariant Kernels 

Divergence free and curl free fields. The following two kernels are matrix valued exponentiated quadratic 
(EQ) kernels (68l and can be used to estimate divergence-free or curl-free vector fields 1 60 1 when the input and 
output space have the same dimension. These kernels induce a similarity between the vector field components 
that depends on the input points, and therefore cannot be reduced to the form K(x, x') = fc(x, x')B. 

We consider the case of vector fields with D = p, where X = R p . The divergence-free matrix-valued kernel 
can be defined via a translation invariant matrix-valued EQ kernel 

$0) = (VV T - V T VI)(j>(u) = H<j>{u) - tt[Htf>(u))lD , 

where H is the Hessian operator and <j> a scalar EQ kernel, so that K(x, x') := <E>(x — x'). 

The columns of the matrix valued EQ kernel, $, are divergence-free. In fact, computing the divergence of a 
linear combination of its columns, V t ($(m)c), with c € R p , it is possible to show that |7) 

V T ($(u)c) = (V T VV T 0(?i))c - (V T V T V<^(u))c = , 
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where the last equality follows applying the product rule of the gradient, the fact that the coefficient vector c does 
not depend upon u and the equality a T aa T — a T a T a, Va £ R p . 

Choosing a exponentiated quadratic, we obtain the divergence-free kernel 

K(x,x') = ie^V, (25) 

K(x, x') := *(x - x') = -VVV(x - x') = -H<f>(x - x') , 

where <f> is a scalar RBF. It is easy to show that the columns of * are curl-free. The j-th column of * is given by 
^ej, where ej is the standard basis vector with a one in the j-th position. This gives us 

$ c/e . = _w T $ c/ej = V(-V T $ c/ej ) = Vff , 

where g — —d(f>/dxj. The function g is a scalar function and the curl of the gradient of a scalar function is always 
zero. Choosing a exponentiated quadratic, we obtain the following curl-free kernel 

It is possible to consider a convex linear combination of these two kernels to obtain a kernel for learning any kind 
of vector field, while at the same time allowing reconstruction of the divergence-free and curl-free parts separately 
(see 1601 ). The interested reader can refer to I68II591I32I for further details on matrix-valued RBF and the properties 
of divergence-free and curl-free kernels. 

Transformable kernels. Another example of invariant kernels is discussed in 1 18 1 and is given by kernels 
defined by transformations. For the purpose of our discussion, let y — R D , Xo be a Hausdorff space and T d a 
family of maps (not necessarily linear) from X to X a for d = {1, . . . ,D} . 

Then, given a continuous scalar kernel k : Xo x Xo — > R, it is possible to define the following matrix valued 
kernel for any x, x' 6 X 

(K(x,x')) = k(T d x,T d ,x'). 

V / d,d' 

A specific instance of the above example is described by 11011 in the context of system identification, see also fT8l 
for further details. 



where 



^=11^ ^ I +((P 



The curl-free matrix valued kernels are obtained as 



5.2 Further Extensions of the LMC 

In |34j| , the authors introduced a nonstationary version of the LMC, in which the coregionalization matrices are 
allowed to vary as functions of the input variables. In particular, B g now depends on the input variable x, this is, 
B 9 (x, x'). The authors refer to this model as the spatially varying LMC (SVLMC). As a model for the varying core- 
gionalization matrix B 9 (x, x'), the authors employ two alternatives. In one of them, they assume that B 9 (x, x') = 
(a(x,x'))*'B q , where a(x) is a covariate of the input variable, and rp is a variable that follows a uniform prior. 
In the other alternative, B 9 (x, x') follows a Wishart spatial process, which is constructed using the definition of 
a Wishart distribution, as follows. Suppose Z E R DxF is a random matrix with entries z dtP ~ JV"(0, 1), indepen- 
dently and identically distributed, for d = 1, . . . , D and p = 1, . . . , P. Define the matrix T = TZ, with T G TL DxD . 
The matrix = TT T = TZZ T r T € R DxD follows a Wishart distribution W(P, IT T ), where P is known as 
the number of degrees of freedom of the distribution. The spatial Wishart process is constructed assuming that z diP 
depends on the input x, this is, Zd, p (x, x'), with z d , p (x., x') ~ Af(0, Pd(x, x')), where {p<i(x, x')}^! are correlation 
functions. Matrix Y(x,x') = TZ(x,x') and fi(x,x') = T(x, x')T T (x, x') = TZ(x, x')Z T (x, x')r T e R DxD 
follows a spatial Wishart process SW(P, FT T , {p d (x, x')} d=1 ). In (34), authors assume r = Id and p d (x, x') is 
the same for all values of d. Inference in this model is accomplished using Markov chain Monte Carlo. For details 
about the particular inference procedure, the reader is referred to |34|. 
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5.3 Process Convolutions 



More general non-separable kernels can also be constructed from a generative point of view. We saw in section 
14.2.11 that the linear model of coregionalization involves instantaneous mixing through a linear weighted sum 
of independent processes to construct correlated processes. By instantaneous mixing we mean that the output 



function f(x) evaluated at the input point x only depends on the values of the latent functions {u q (x)}® =1 at the 
same input x. This instantaneous mixing leads to a kernel function for vector-valued functions that has a separable 
form. 

A non-trivial way to mix the latent functions is through convolving a base process with a smoothing kernelQ If 
the base process is a Gaussian process, it turns out that the convolved process is also a Gaussian process. We can 
therefore exploit convolutions to construct covariance functions |8 102. 43. 44. 14 56)121. 

In a similar way to the linear model of coregionalization, we consider Q groups of functions, where a particular 
group q has elements u\ (z), for i = 1, . . . , R g . Each member of the group has the same covariance k q (x, x'), but is 
sampled independently. Any output /d(x) is described by 



where 



Q «g , Q 
/d( x ) = 5Z2Z / Gd ig (x-zK(z)dz + Kj d (x) = ^2fd( y 

q=l i=l J x g=l 

= £ / G^(x-z)<(z)dz, 

._i Jx 



Wd(x), 



(27) 



and {wa(]c)}d =1 are independent Gaussian processes with zero mean and covariance k Wd (x, x'). For the integrals 
in equation i27l to exist, it is assumed that each kernel G\ 9 (x) is a continuous function with compact support 
(46l or square-integrable I102U44I . The kernel Gj 9 (x) is also known as the moving average function |102| or the 
smoothing kernel l44l . We have included the superscript q for /J (x) in (27\ to emphasize the fact that the function 
depends on the set of latent processes {w 9 (x)}^ 1 . The latent functions u\(z) are Gaussian processes with general 
covariances fc 9 (x, x'). 

Under the same independence assumptions used in the linear model of coregionalization, the covariance be- 
tween /<j(x) and f d / (x') follows 

Q 

(K(x,x')) d , d / = (x, x') + k Wd (x, x')Sd,d' , (28) 



9=1 



where 



fc/!,/«,(x,x') = X1 / G d,9(x-z) / G d / ia (x'-z')/c,(z,z')dz'dz. (29) 

i=l •> x X 

Specifying G d q (x — z) and k q (z, z') in the equation above, the covariance for the outputs / d (x) can be constructed 
indirectly. Notice that if the smoothing kernels are taken to be the Dirac delta function in equation {29), such 
that G' d „(x — z) = a d 9 <5(x — z)Q the double integral is easily solved and the linear model of coregionalization 
is recovered. In this respect, process convolutions could also be seen as a dynamic version of the linear model of 
coregionalization in the sense that the latent functions are dynamically transformed with the help of the kernel 
smoothing functions, as opposed to a static mapping of the latent functions in the LMC case. See section l5.3.1l for 
a comparison between the process convolution and the LMC. 

A recent review of several extensions of this approach for the single output case is presented in |17|. Some 
of those extensions include the construction of nonstationary covariances (43 45 .30 31 .73 1 and spatiotemporal 
covariances I109II107|[T081 . 

The idea of using convolutions for constructing multiple output covariances was originally proposed by 1 102 1. 
They assumed that Q = 1, R q = 1, that the process u(x) was white Gaussian noise and that the input space was 



6 We use kernel to refer to both reproducing kernels and smoothing kernels. Smoothing kernels are functions which are convolved with a 
signal to create a smoothed version of that signal. 

7 We have slightly abused of the delta notation to indicate the Kronecker delta for discrete arguments and the Dirac function for continuous 
arguments. The particular meaning should be understood from the context. 
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X = R p . [44] depicted a similar construction to the one introduced by 11021 , but partitioned the input space into 
disjoint subsets X = f] , - Xd, allowing dependence between the outputs only in certain subsets of the input space 
where the latent process was common to all convolutions 

Higdon [44J coined the general moving average construction to develop a covariance function in equation {27} 
as a process convolution. 

Boyle and Frean 1141 1151 introduced the process convolution approach for multiple outputs to the machine 
learning community with the name of "dependent Gaussian processes" (DGP), further developed in [131- They 
allow the number of latent functions to be greater than one (Q > 1). In [56] and (2), the latent processes {u q (x)}^ =1 
followed a more general Gaussian process that goes beyond the white noise assumption. 

5.3.1 Comparison Between Process Convolutions and LMC 

Figure |6] shows an example of the instantaneous mixing effect obtained in the ICM and the LMC, and the non- 
instantaneous mixing effect due to the process convolution framework. We sampled twice from a two-output 
Gaussian process with an ICM covariance with R q = 1 (first column), an LMC covariance with R q — 2 (second 
column) and a process convolution covariance with R q = 1 and Q = 1 (third column). As in the examples for 
the LMC, we use EQ kernels for the basic kernels k q (x, x'). We also use an exponentiated quadraticform for the 
smoothing kernel functions G\ tl (x — x') and G\ A (yi — x') and assume that the latent function is white Gaussian 
noise. 

Notice from Figure |6] that samples from the ICM share the same length-scale. Samples from the LMC are 
weighted sums of functions with different length-scales (a long length-scale term and a short length-scale term). 
In both models, ICM and LMC, the two outputs share the same length-scale or the same combination of length- 
scales. Samples from the PC show that the contribution from the latent function is different over each output. 
Output fi(x) has a long length-scale behavior, while output f2(x) has a short length-scale behavior. 

It would be possible to get similar samples to the PC ones using a LMC. We would need to assume, though, 
that some covariances in a particular coregionalization matrix B q are zero. In Figure[7] we display samples from 
a LMC with R q = 2 and Q = 2. We have forced b( A — &f 2 = = 0. To generate these samples we use an ICM 
with R q = 2 and a latent function with long length-scale, and then add a sample from an independent Gaussian 
process with a short length-scale to output faix). It is debatable if this compound model (ICM plus independent 
GP) would capture the relevant correlations between the output functions. 

To summarize, the choice of a kernel corresponds to specifying dependencies among inputs and outputs. In 
the linear model of co-regionalization this is done considering separately inputs, via the kernels k q , and outputs, 
via the coregionalization matrices B 9 , for q = 1, . . . , Q. Having a large large value of Q allows for a larger ex- 
pressive power of the model. For example if the output components are substantially different functions (different 
smoothness or length scale), we might be able to capture their variability by choosing a sufficiently large Q. This 
is at the expense of a larger computational burden. 

On the other hand, the process convolution framework attempts to model the variance of the set of outputs by 
the direct association of a different smoothing kernel Gd(x) to each output /d(x). By specifying Gd(x), one can 
model, for example, the degree of smoothness and the length-scale that characterizes each output. If each output 
happens to have more than one degree of variation (marginally, it is a sum of functions of varied smoothness) one 
is faced with the same situation as in LMC, namely, the need to augment the parameter space so as to satisfy a 
required precision. However, due to the local description of each output that the process convolution performs, it 
is likely that the parameter space for the process convolution approach grows slower than the parameter space for 



5.3.2 Other Approaches Related to Process Convolutions 

In 1611 , a different moving average construction for the covariance of multiple outputs was introduced. It is 
obtained as a convolution over covariance functions in contrast to the process convolution approach where the 
convolution is performed over processes. Assuming that the covariances involved are isotropic and the only 
latent function tt(x) is a white Gaussian noise, 1611 show that the cross-co variance obtained from 



LMC. 




8 The latent process «(x) was assumed to be independent on these separate subspaces. 
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Figure 6: Two samples from three kernel matrices obtained using the intrinsic coregionalization model with R q = 1 
(first column), the linear model of coregionalization with R q = 2 (second column) and the process convolution 
formalism with R q = 1 and Q = 1 (third column). Solid lines represent one of the samples. Dashed lines represent 
the other sample. There are two outputs, one row per output. Notice that for the ICM and the LMC, the outputs 
have the same length-scale (in the ICM case) or combined length-scales (in the LMC case). The outputs generated 
from the process convolution covariance differ in their relative length-scale. 
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Figure 7: Two samples from a linear model of coregionalization with R q = 2 and Q = 2. The solid lines represent 
one of the samples. The dashed lines represent the other sample. Samples share a long length-scale behavior. An 
added short length-scale term appears only in output two. 



where fcd(h) and k d i(h) are covariances associated to the outputs d and d! , lead to a valid covariance function for 
the outputs {/d(x)}dLi. If we assume that the smoothing kernels are not only square integrable, but also posi- 
tive definite functions, then the covariance convolution approach turns out to be a particular case of the process 
convolution approach (square-integrability might be easier to satisfy than positive definiteness). 

1671 introduced the idea of transforming a Gaussian process prior using a discretized process convolution, 
f d = G d M, where G<j £ R JVxM is a so called design matrix with elements {G d (x n , z m )}^i J ^ m=1 and u T = 
[m(xi), . . . ,u(xm)]. Such a transformation could be applied for the purposes of fusing the information from mul- 
tiple sensors, for solving inverse problems in reconstruction of images or for reducing computational complexity 
working with the filtered data in the transformed space |92l . Convolutions with general Gaussian processes for 
modelling single outputs, were also proposed by 1301 1311 , but instead of the continuous convolution, 13011311 used 
a discrete convolution. The purpose in 1 30 31] was to develop a spatially varying covariance for single outputs, 
by allowing the parameters of the covariance of a base process to change as a function of the input domain. 

Process convolutions are closely related to the Bayesian kernel method l77l l58l construct reproducible kernel 
Hilbert spaces (RKHS) by assigning priors to signed measures and mapping these measures through integral 
operators. In particular, define the following space of functions, 

r={f\f(x) = f G(x,zy,(dz), 7 er}, 

for some space T C B{X) of signed Borel measures. In (77l proposition 1], the authors show that for r = B(X), 
the space of all signed Borel measures, T corresponds to a RKHS. Examples of these measures that appear in the 
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form of stochastic processes include Gaussian processes, Dirichlet processes and Levy processes. In principle, we 
can extend this framework for the multiple output case, expressing each output as 

f d (x) = / G d (x, z)7(dz). 
J x 

6 Inference and Computational Considerations 

Practical use of multiple-output kernel functions require the tuning of the hyperparameters, and dealing with the 
issue of computational burden related directly with the inversion of matrices of dimension ND x ND. Cross- 
validation and maximization of the log-marginal likelihood are alternatives for parameter tuning, while matrix 
diagonalization and reduced-rank approximations are choices for overcoming computational complexity of the 
matrix inversion. 

In this section we refer to the parameter estimation problem for the models presented in section H] and [5] and 
also to the computational complexity when using those models in practice. 

6.1 Estimation of Parameters in Regularization Theory 

From a regularization perspective, once the kernel is fixed, to find a solution we need to solve the linear system 
defined in ((5). The regularization parameter as well as the possible kernel parameters are typically tuned via cross- 
validation. The kernel free-parameters are usually reduced to one or two scalars (e.g. the width of a scalar kernel). 
While considering for example separable kernels the matrix B is fixed by design, rather than learned, and the only 
free parameters are those of the scalar kernel. 

Solving problem (5), this is c = (K(X, X) + A7VI) _1 y, is in general a costly operation both in terms of memory 
and time. When we have to solve the problem for a single value of A Cholesky decomposition is the method 
of choice, while when we want to compute the solution for different values of A (for example to perform cross 
validation) singular valued decomposition (SVD) is the method of choice. In both case the complexity in the worst 
case is 0(D 3 N 3 ) (with a larger constant for the SVD) and the associated storage requirement is 0(D 2 N 2 ) 

As observed in 1 7], this computational burden can be greatly reduced for separable kernels. For example, if we 
consider the kernel K(x, x') = k(x, x')I the kernel matrix K(X, X) becomes block diagonal. In particular if the 
input points are the same, all the blocks are equal and the problem reduces to inverting an TV by N matrix. The 
simple example above serves as a prototype for the more general case of a kernel of the form K(x, x') = fc(x, x')B. 
The point is that for this class of kernels, we can use the eigen-system of the matrix B to define a new coordinate 
system where the kernel matrix becomes block diagonal. 

We start observing that if we denote with (oi, m ),..., (o"d, ud) the eigenvalues and eigenvectors of B we can 
write the matrix C = (ci, . . . , cat), with c; G R D , as C = J^d=i ^ ® Ud > wnere c d = ((ci,u d ) D , (c N , u d ) D ) 
and ® is the tensor product and similarly Y = J2d=i V d ® Ud ' w ^ tn y d = ((yi> u d).o j ■ • • i (yjVi u d} D )- The above 
transformations are simply rotations in the output space. Moreover, for the considered class of kernels, the kernel 
matrix K(X, X) is given by the tensor product of the N x N scalar kernel matrix fc(X, X) and B, that is K(X, X) = 
B ® fc(X, X). 

Then we have the following equalities 

D 

C = (K(X, X) + ATAjvI) _1 Y = ^](B ® fe(X, X) + NXnI)' 1 ^ ® u d 

d=l 
D 

= ^2(<r d k(X, X) + NXs*)' 1 ?* ® Ud. 

d=l 

Since the eigenvectors Uj are orthonormal, it follows that: 

c d = (a d k(X,X) + N\ N iy 1 y J = ( k(X,X) + — l) (30) 

V o d ) o d 

for d — 1, . . . , D. The above equation shows that in the new coordinate system we have to solve D essentially 
independent problems after rescaling each kernel matrix by a d or equivalently rescaling the regularization param- 
eter (and the outputs). The above calculation shows that all kernels of this form allow for a simple implementation 
at the price of the eigen-decomposition of the matrix B. Then we see that the computational cost is now essentially 
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O(Z) 3 ) + 0(N 3 ) as opposed to 0(D 3 N 3 ) in the general case. Also, it shows that the coupling among the different 
tasks can be seen as a rotation and rescaling of the output points. Stegle et al. l94l also applied this approach in the 
context of fitting matrix variate Gaussian models with spherical noise. 



6.2 Parameters Estimation for Gaussian Processes 

In machine learning parameter estimation for Gaussian processes is often approached through maximization of the 
marginal likelihood. The method also goes by the names of evidence approximation, type II maximum likelihood, 
empirical Bayes, among others 1 10 1. 

With a Gaussian likelihood and after integrating f using the Gaussian prior, the marginal likelihood is given 

by 

p(y|X,0)=Af(y|O,K(X,X) + £), (31) 

where 4> are the hyperparameters. 

The objective function is the logarithm of the marginal likelihood 

logp(y|X, <t>) = -iy T (K(X, X)+£)- l y - ± log |K(X, X) + S| 

-^log27r. (32) 

The parameters <fi are obtained by maximizing logp(y|X, (f>) with respect to each element in (f>. Maximization is 
performed using a numerical optimization algorithm, for example, a gradient based method. Derivatives follow 



dlogp(y1X,4>) _ 1 T _— __i0K(X 1 X 



dcj>i 2 



= -y'K(X,X) ^-^K(X,X) y 



-i9K(X,X) 



- trace ( K(X, X) \^ ' ] , (33) 



where 4n is an element of the vector tfi and K(X, X) = K(X, X) + S. In the case of the LMC, in which the core- 
gionalization matrices must be positive semidefinite, it is possible to use an incomplete Cholesky decomposition 
B 9 = iqiiq , with L g G R D x Rq , as suggested in |12|. The elements of the matrices L q are considered part of the 
vector 4>. 

Another method used for parameter estimation, more common in the geostatistics literature, consists of op- 
timizing an objective function which involves some empirical measure of the correlation between the functions 
/d(x), K(x, x'), and the multivariate covariance obtained using a particular model, K(x, x') ['38 . 53. 76|. Assum- 
ing stationary covariances, this criteria reduces to 
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WSS=^w(h i ) trace H(K(hO - K(h;))] I, (34) 



where h; = x; — x.[ is a lag vector, w(hi) is a weight coefficient, K(h;) is an experimental covariance matrix with 
entries obtained by different estimators for cross-covariance functions 175111021 , and K(h 4 ) is the covariance matrix 
obtained, for example, using the linear model of coregionalization|3 One of the first algorithms for estimating the 
parameter vector 4> in LMC was proposed by [38). It assumed that the parameters of the basic covariance functions 
k q (x, x') had been determined a priori and then used a weighted least squares method to fit the coregionalization 
matrices. In 1 76 1 the efficiency of other least squares procedures was evaluated experimentally, including ordinary 
least squares and generalized least squares. Other more general algorithms in which all the parameters are esti- 
mated simultaneously include simulated annealing (54l and the EM algorithm 11141 . Ver Hoef and Barry 11021 also 
proposed the use of an objective function like {34), to estimate the parameters in the covariance obtained from a 
process convolution. 



9 Note that the common practice in geostatistics is to work with variograms instead of covariances. A variogram characterizes a general 
class of random functions known as intrinsic random functions 1621 , which are random processes whose increments follow a stationary second- 
order process. For clarity of exposition, we will avoid the introduction of the variogram and its properties. The interested reader can follow 
the original paper by [62] for a motivation of their existence, 1361 for a comparison between variograms and covariance functions and 1371 for 
a definition of the linear model of coregionalization in terms of variograms. 
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Both methods described above, the evidence approximation or the least-square method, give point estimates 
of the parameter vector <p. Several authors have employed full Bayesian inference by assigning priors to <j> and 
computing the posterior distribution through some sampling procedure. Examples include [42] and [23] under 
the LMC framework or (TH and (99) under the process convolution approach. 

As mentioned before, for non-Gaussian likelihoods, there is not a closed form solution for the posterior distri- 
bution nor for the marginal likelihood. However, the marginal likelihood can be approximated under a Laplace, 
variational Bayes or expectation propagation (EP) approximation frameworks for multiple output classification 
I93H11I . and used to find estimates for the hyperparameters. Hence, the error function is replaced for log q(y |X, </>), 
where g(y|X, <f>) is the approximated marginal likelihood. Parameters are again estimated using a gradient based 
methods. 

The problem of computational complexity for Gaussian processes in the multiple output context has been 
studied by different authors 1 83 . 103 96 13 2 1|. Fundamentally, the computational problem is the same than 
the one appearing in regularization theory, that is, the inversion of the matrix K(X, X) = K(X, X) + S for solv- 
ing equation (5). This step is necessary for computing the marginal likelihood and its derivatives (for estimating 
the hyperparameters as explained before) or for computing the predictive distribution. With the exception of the 
method by |83 |, the approximation methods proposed in 11031 l%l [131 l2l HI can be applied to reduce computa- 
tional complexity, whichever covariance function (LMC or process convolution, for example) is used to compute 
the multi-output covariance matrix. In other words, the computational efficiency gained is independent of the 
particular method employed to compute the covariance matrix. 

Before looking with some detail at the different approximation methods employed in the Gaussian processes 
literature for multiple outputs, it is worth mentioning that computing the kernel function through process con- 
volutions in equation (29) implies solving a double integral, which is not always feasible for any choice of the 
smoothing kernels G d>q {-) and covariance functions fc 9 (x, x'). An example of an analytically tractable covariance 
function occurs when both the smoothing kernel and the covariance function for the latent functions have EQ 
kernels |2|, or when the smoothing kernels have an exponentiated quadratic form and the latent functions are 
Gaussian white noise processes 1731 H3I . An alternative would be to consider discrete process convolutions l44l 
instead of the continuous process convolution of equations ( f28i and (29), avoiding in this way the need to solve 
double integrals. 

We now briefly summarize different methods for reducing computational complexity in multi-output Gaussian 
processes. 

As we mentioned before, Rougier (83] assumes that the multiple output problem can be seen as a single output 
problem considering the output index as another variable of the input space. The predicted output, /(x») is 
expressed as a weighted sum of Q deterministic regressors that explain the mean of the output process plus a 
Gaussian error term that explains the variance in the output. Both, the set of regressors and the covariance for 
the error are assumed to be separable in the input space. The covariance takes the form fc(x, x.')kr(d, d'), as 
in the introduction of section [4] For isotopic models (|83| refers to this condition as regular outputs, meaning 
outputs that are evaluated at the same set of inputs X), the mean and covariance for the output, can be obtained 
through Kronecker products for the regressors and the covariances involved in the error term. For inference 
the inversion of the necessary terms is accomplished using properties of the Kronecker product. For example, 
if K(X,X') = B ® fc(X,X'), then K _1 (X,X') = B _1 ® /^(X.X'). Computational complexity is reduced to 
O(Z) 3 ) + 0(N 3 ), similar to the eigendecomposition method in section[6j] 

Ver Hoef and Barry [103 ] present a simulation example with D — 2. Prediction over one of the variables is per- 
formed using cokriging. In cokriging scenarios, usually one has access to a few measurements of a primary variable, 
but plenty of observations for a secondary variable. In geostatistics, for example, predicting the concentration of 
heavy pollutant metals (say Cadmium or Lead), which are expensive to measure, can be done using inexpensive 
and oversampled variables as a proxy (say pH levels) 1371 . Following a suggestion by l95l (page 172), the authors 
partition the secondary observations into subgroups of observations and assume the likelihood function is the 
sum of the partial likelihood functions of several systems that include the primary observations and each of the 
subgroups of the secondary observations. In other words, the joint probability distribution p(/i(Xi), /2(X2)) is 
factorised asp(/i(Xi), / 2 (X 2 )) = n/=i p(/i(Xi), / 2 (j) (X^)), where / 2 (j) (X< j) ) indicates the observations in the 
subgroup j out of J subgroups of observations, for the secondary variable. Inversion of the particular covariance 
matrix derived from these assumptions grows as 0(JN 3 ), where TV is the number of input points per secondary 
variable. 

Also, the authors use a fast Fourier transform for computing the autocovariance matrices (K(Xd, X.d))d,d and 
cross-covariance matrices (K(X<j, X d /)) d d /. 

Boyle 1 13 1 proposed an extension of the reduced rank approximation method presented by (80), to be applied 
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to the dependent Gaussian process construction. The author outlined the generalization of the methodology for 
D = 2. The outputs /i(Xi) and /2(X2) are defined as 

(K(Xi,Xi))i,i (K(Xi,Xa))i,a' 
(K(X 2 ,Xi)) 2 ,i (K(X 2 ,X 2 )) 2 , 2 

where w d are vectors of weights associated to each output including additional weights corresponding to the test 
inputs, one for each output. Based on this likelihood, a predictive distribution for the joint prediction of ft (X) and 
/ 2 (X) can be obtained, with the characteristic that the variance for the approximation, approaches to the variance 
of the full predictive distribution of the Gaussian process, even for test points away from the training data. The 
elements in the matrices (K(X<j, X d /)) d?d / are computed using the covariances and cross-covariances developed 
in sections H] and |5] Computational complexity reduces to 0(DNM 2 ), where N is the number of sample points 
per output and M is an user specified value that accounts for the rank of the approximation. 

In 1 2 1, the authors show how through making specific conditional independence assumptions, inspired by the 
model structure in the process convolution formulation (for which the LMC is a special case), it is possible to 
arrive at a series of efficient approximations that represent the covariance matrix K(X, X) using a reduced rank 
approximation Q plus a matrix D, where D has a specific structure that depends on the particular independence 
assumption made to obtain the approximation. Approximations can reduce the computational complexity to 
0(NDM 2 ) with M representing a user specified value that determines the rank of Q. Approximations obtained 
in this way, have similarities with the conditional approximations summarized for a single output in (81). 

Finally, the informative vector machine (IVM) [57| has also been extended to Gaussian processes using kernel 
matrices derived from particular versions of the linear model of coregionalization, including |96| and |55|. In the 
IVM, only a smaller subset of size M of the data points is chosen for constructing the GP predictor. The data 
points selected are the ones that maximize a differential entropy score |55l or an information gain criteria (96). 
Computational complexity for this approximation is again 0(NDM 2 ). 

For the computational complexities shown above, we assumed R q = 1 and Q = 1. 



7 Applications of Multivariate Kernels 

In this chapter we further describe in more detail some of the applications of kernel approaches to multi-output 
learning from the statistics and machine learning communities. 

One of the main application areas of multivariate Gaussian process has been in computer emulation. In (29), the 
LMC is used as the covariance function for a Gaussian process emulator of a finite-element method that solves for 
frequency response functions obtained from a structure. The outputs correspond to pairs of masses and stiffnesses 
for several structural modes of vibration for an aircraft model. The input space is made of variables related to 
physical properties, such as Tail tip mass or Wingtip mass, among others. 

Multivariate computer emulators are also frequently used for modelling time series. We mentioned this type 
of application in section l4.2.4l Mostly, the number of time points in the time series are matched to the number of 
outputs (we expressed this as D = T before), and different time series correspond to different input values for 
the emulation. The particular input values employed are obtained from different ranges that the input variables 
can take (given by an expert), and are chosen according to some space-filling criteria (Latin hypercube design, for 
example) l84l . In 1231 , the time series correspond to the evolution of the net biome productivity (NBP) index, which 
in turn is the output of the Sheffield dynamic global vegetation model. In [63], the time series is the temperature of 
a particular location of a container with decomposing foam. The simulation model is a finite element model and 
simulates the transfer of heat through decomposing foam. 

In machine learning the range of applications for multivariate kernels is increasing. In 1721 , the ICM is used 
to model the dependencies of multivariate time series in a sensor network. Sensors located in the south coast of 
England measure different environmental variables such as temperature, wind speed, tide height, among others. 
Sensors located close to each other make similar readings. If there are faulty sensors, their missing readings could 
be interpolated using the healthy ones. 

In 1 22 1, the authors use the ICM for obtaining the inverse dynamics of a robotic manipulator. The inverse 
dynamics problem consists in computing the torques at different joints of the robotic arm, as function of the angle, 
angle velocity and angle acceleration for the different joints. Computed torques are necessary to drive the robotic 
arm along a particular trajectory. Furthermore, the authors consider several contexts, this is, different dynamics due 
to different loadings at the end effector. Joints are modelled independently using an ICM for each of them, being 
the outputs the different contexts and being the inputs, the angles, the angle velocities and the angle accelerations. 
Besides interpolation, the model is also used for extrapolation of novel contexts. 
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The authors of fTTI use the ICM for preference elicitation, where a user is prompted to solve simple queries in 
order to receive a recommendation. The ICM is used as a covariance function for a GP that captures dependencies 
between users (through the matrix B), and dependencies between items (through the covariance fc(x, x')). 

In 1 56 1 and |33|, the authors use a process convolution to model the interaction between several genes and a 
transcription factor protein, in a gene regulatory network. Each output corresponds to a gene, and each latent 
function corresponds to a transcription factor protein. It is assummed that transcription factors regulate the rate at 
which particular genes produce primary RNA. The output functions and the latent functions are indexed by time. 
The smoothing kernel functions G' d q (-) correspond to the impulse response obtained from an ordinary differential 
equation of first order. Given gene expression data, the problem is to infer the time evolution of the transcription 
factor. 

In |3), the authors use a process convolution to model the dependencies between different body parts of an 
actor that performs modern dancing movements. This type of data is usually known as mocap (for motion capture) 
data. The outputs correspond to time courses of angles referenced to a root node, for each body part modelled. 
The smoothing kernel used corresponds to a Green's function arising from a second order ordinary differential 
equation. 

In [67 1, the authors use a discretized process convolution for solving an inverse problem in reconstruction of 
images, and for fusing the information from multiple sensors. 

In (16*1 , two particulate matter (PM) levels measured in the air (10 /im in diameter and 2.5 /im in diameter), at 
different spatial locations, are modeled as the added influence of coarse and fine particles. In turn, these coarse 
and fine particles are modeled as random walks and then transformed by discrete convolutions to represent the 
levels of PM at 10 /im and 2.5 /im. The objective is to extract information about PM at 2.5 /im from the abundant 
readings of PM at 10 /im. 

8 Discussion 

We have presented a survey of multiple output kernel functions to be used in kernel methods including regular- 
ization theory and Gaussian processes. From the regularization theory point of view, the multiple output problem 
can be seen as a regularization method that minimizes directly a loss function while constraining the parameters 
of the learned vector-valued function. In a Gaussian process framework, from a machine learning context, the 
multiple output problem is equivalent to formulate a generative model for each output that expresses correlations 
as functions of the output function index and the input space, using a set of common latent functions. 

We presented two general families of kernels for vector-valued functions including the separable family (in- 
cluding the SoS kernels) and different instantiations of what we would call the nonseparable family. The separable 
family represent the kernel function as the product of a kernel for the outputs, independently of the value that the 
input can have, and a kernel function for the input space, independently of the output index. The most general 
model is the linear model of coregionalization, with many other kernel functions that appear in the machine learn- 
ing literature as particular cases. Within the family of nonseparable kernels, the process convolution construction 
has proved useful for several theoretical and applied problems and as we have seen before, it can be considered as 
a generalization of the linear model of coregionalization. 

Model selection establishes a path for future research in multiple-output kernels related problems. From a 
Bayesian perspective, in the setup of LMC and process convolutions, model selection includes principled mech- 
anisms to find the number of latent functions and/or the rank of the coregionalization matrices. More general 
model selection problems involve the ability to test if given some data, the outputs are really correlated or influ- 
ence each other, compared to the simpler assumption of independence. Other model selection problem includes 
the influence of the input space configuration (isotopic against heterotopic) towards the sharing of strengths be- 
tween outputs. Although such problems have been studied to some extent in the geostatistics literature, there 
remain open issues. 
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Notation 



Generalities 

p dimensionality of the input space 

D number of outputs 

N, Nd number of data points for output d 

Q number of latent functions (for generative models) 

X input space 

X d input training data for output d, X d = {x<i,n} n =i 
X input training data for all outputs, X = {Xd}J =1 

Functions 

fc(x, x') general scalar kernel 

K(x, x') general kernel valued matrix with entries (K(x, x')) d d , with d, d = 1, . . . , D 
k q (x, x') scalar kernel for the q— th latent function 
/d(x) rf-th output evaluated at x 

f (x) , vector-valued function, f (x) = [/i (x), . . . , /d(x)] t 
Sk,k' Kronecker delta for discrete arguments 
5(x) Dirac delta for continuous arguments 

Vectors and matrices 

k q (X, X) kernel matrix with entries k q (x, x') evaluated at X 
f d (X d ) f d (x) evaluated at X d , t d = [f d {x-d,i), /d(x d ,jv d )] T 
f (X) vectors {id}d =1 , stacked in a column vector 

(K(X d , X d ,)) d d , kernel matrix with entries (K(x d ,„, x. d >, m ))d,d' witn x d,« G x d and x d / jm E X d / 
K(X,X) kernel matrix with blocks (K(X d , X d ,)) d d , with d, d' = 1, . . . , D 
I at identity matrix of size TV 
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